|
The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System around the year 1970. Like many other retrieval systems, the Rocchio feedback approach was developed using the Vector Space Model. The algorithm is based on the assumption that most users have a general conception of which documents should be denoted as relevant or non-relevant.〔Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: ''An Introduction to Information Retrieval'', page 181. Cambridge University Press, 2009.〕 Therefore, the user's search query is revised to include an arbitrary percentage of relevant and non-relevant documents as a means of increasing the search engine's recall, and possibly the precision as well. The number of relevant and non-relevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section.〔 ==Algorithm== The formula and variable definitions for Rocchio relevance feedback is as follows:〔 | Original Query Vector |- | | Related Document Vector |- | | Non-Related Document Vector |- | | Original Query Weight |- | | Related Documents Weight |- | | Non-Related Documents Weight |- | | Set of Related Documents |- | | Set of Non-Related Documents |} As demonstrated in the Rocchio formula, the associated weights (a, b, c) are responsible for shaping the modified vector in a direction closer, or farther away, from the original query, related documents, and non-related documents. In particular, the values for b and c should be incremented or decremented proportionally to the set of documents classified by the user. If the user decides that the modified query should not contain terms from either the original query, related documents, or non-related documents, then the corresponding weight (a, b, c) value for the category should be set to 0. In the later part of the algorithm, the variables Dr, and Dnr are presented to be sets of vectors containing the coordinates of related documents and non-related documents. Though Dr and Dnr are not vectors themselves, and are the vectors used to iterate through the two sets and form vector summations. These sums are normalized (divided) by the size of their respective document set (Dr, Dnr). In order to visualize the changes taking place on the modified vector, please refer to the image below.〔 As the weights are increased or decreased for a particular category of documents, the coordinates for the modified vector begin to move either closer, or farther away, from the centroid of the document collection. Thus if the weight is increased for related documents, then the modified vectors coordinates will reflect being closer to the centroid of related documents. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Rocchio algorithm」の詳細全文を読む スポンサード リンク
|